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Efficient Fuzzy Match for Evaluating Data Records 
Field of the Invention 

5 The present invention concerns a system for evaluating input data records based 

on the contents of a reference table to determine a closeness of the input record and 
entries in the reference table. 

Background Art 

Decision support analysis on data warehouses influences important business 
10 decisions; therefore, accuracy of such analysis is important. However, data received at the 
data warehouse fi"om external sources usually contains errors, e.g., spelling mistakes, 
inconsistent conventions across data sources, missing fields. Consequently, a significant 
amount of time and money are spent on data cleaning, the task of detecting and 
correcting errors in data. A prudent alternative to the expensive periodic data cleaning of 
15 an entire data warehouse is to avoid the introduction of errors during the process of 
adding new data into the warehouse. This approach requires input tuples to be validated 
and corrected before they are added to the database. 

A known technique validates incoming tuples against reference relations 
consisting of known-to-be-clean tuples in the database. The reference relations may be 

20 internal to the data warehouse (e.g., customer or product relations) or obtained fi-om 
external sources (e.g., valid address relations from postal departments). An enterprise 
maintaining a relation consisting of all its products may ascertain whether or not a sales 
record fi:-om a distributor describes a valid product by matching the product attributes 
(e.g., Part Number and Description) of the sales record with the Product relation; here, the 

25 Product relation is the reference relation. If the product attributes in the sales record 

match exactly with a tuple in the Product relation, then the described product is likely to 

be valid. However, due to errors in sales records, often the input product tuple does not 

match exactly with any in the Product relation. Then, errors in the input product tuple 

need to be corrected before it is stored. The information in the input tuple is still very 

30 usefiil for identifying the correct reference product tuple, provided the matching is 

Express Mail Label No. 

I hereby certify that this paper is (Ming deposited today 
^ - ^ 1 with the U.S. Postal Service as Exifross Mail addressed 

MbiU1555.1 tinal 1 to the Assistant Commissioner fwPitBrtts. P.O. Box 1450, 

Alexandria. V^^13-J450 / 



resilient to errors in the input tuple. Error-resilient matching of input tuples against the 
reference table is referred to as a fuzzy match operation. 

Suppose an enterprise wishes to ascertain whether or not the sales record 
describes an existing customer by fuzzily matching the customer attributes of the sales 

5 record against the Customer relation. The reference relation. Customer, contains tuples 
describing all current customers. If the fuzzy match returns a target customer tuple that is 
either exactly equal or "reasonably close" to the input customer tuple, then the input tuple 
would have been validated or corrected. A notion of closeness between tuples is usually 
measured by a similarity function. If the similarity between an input customer tuple and 

10 its closest reference tuple is higher than some threshold, then the correct reference tuple is 
loaded. Otherwise, the input is routed for further cleaning before considering it as 
referring to a new customer. A fuzzy match operation that is resiUent to input errors can 
effectively prevent the proliferation of fuzzy duplicates in a relation, i.e., multiple tuples 
describing the same real world entity. See Hernandez et al "The merge/purge problem 

15 for large databases" in Proceedings of the ACM SIGMOD, San Jose, CA May 1995. 

Several methods for approximate string matching over dictionaries or collections 
of text documents have been proposed (e.g., Gravano et al "Approximate string joins in a 
database (almost) for Free". In Proceedings of VLDB, Roma, Italy, September 11- 
14,2001 and Navarro et al "Indexing methods for approximate string matching." In IEEE 
20 Data Engineering Bulletin, 24(4): 19-27,2001.). All of the above methods use edit 
distance as the similarity function, not considering the crucial aspect of differences in 
importance of tokens while measuring similarity. 

Approximate string matching methods [e.g., R. Baeza- Yates and G. Navarro. A 
practical index for text retrieval allowing errors. In R. Monge, editor. Proceedings of the 

IS XXIII Latin American Conference on Informatics (CLEIV?), Valparaiso, Chile, 1997. and 
G. Navarro, E. Sutinen, J. Tanninen, and J. Tarhio. Indexing text with approximate q- 
grams. In Proceedings of the 11th Annual Symposium on Combinatorial Pattern 
Matching (CPM*2000), LNCS 1848, 2000.] preprocess the set of dictionary/text strings to 
build q-gram tables containing tuples for every string s of length q that occurs as a 

30 substring of some reference text string; the record also consists of the list of identifiers (or 
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locations) of strings of which s is a substring. The error tolerant index relation ETI we 
build from the reference relation is similar in that we also store q-grams along with the 
list of record identifiers in which they appear, but the ETI (i) is smaller than a full q-gram 
table because we only select (probabilistically) a subset of all q-grams per tuple, and (ii) 
5 encodes column-boundaries specific to relational domains. 

The information retrieval community has successfully exploited inverse document 
frequency (IDF) weights for differentiating the importance of tokens or words. However, 
the IR application assumes that all input tokens in the query are correct, and does not deal 
with errors therein. Only recently, some search engines (e.g., Google's "Did you mean?" 

10 feature) are beginning to consider even simple spelling errors. In the fuzzy match 
operation, we deal with tuples containing very few tokens (many times, around 10 or less) 
and hence cannot afford to ignore erroneous input tokens, as they could be crucial for 
differentiating amongst many thousands of reference tuples. For example, the erroneous 
token 'beoing' in the input tuple [beoing corporation, Seattle, wa, NULL] is perhaps the 

15 most useful token for identifying the target from among all corporation records of 
companys in the Seattle area. Clustering and reference matching algorithms [e.g., W. 
Cohen. Integration of heterogeneous databases without common domains using queries 
based on textual simileinty. In Proceedings ofACMSIGMOD, Seattle, WA, June 1998. 

W. Cohen. Data integration using similarity joins and a word-based information 
20 representation language. ACM Transactions on Information Systems, 18(3):288-321, 
July 2000.E. Cohen and D. Lewis. Approximating matrix multiplication for pattem 
recognition tasks. In SODA: ACM-SIAM Symposium on Discrete Algorithms, 1997.] 
using the cosine similarity metric with IDF weighting also share the limitation of ignoring 
erroneous input tokens. Further, efficiency is improved by choosing probabilistically a 
25 subset of tokens from each document under the correct input token assumption. 

As discussed earlier, ahnost all solutions for the nearest neighbor problem are 
targeted at data in Euclidean/normed spaces and hence are inapplicable to the present 
invention. See V. Gaede and O. Gunther. ''Multidimensional access methods." ACM 
Computing Surveys, 30(2): 170— 231, 1998. There has been some recent work on general 
30 metric spaces [e.g., P. Ciaccia, M. Patella, P. Zezula. M-tree: An efficient access method 
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for similarity search in metric spaces. VLDB 1997. G. Navarro. Searching in metric 
spaces by spatial approximation. The VLDB Journal, ll(l):28-46, 2002. Their 
complexity and performance are not suitable for the high-throughput systems of interest 
here. Moreover, many of these solutions cannot be deployed easily over current data 
5 warehouses because they require specialized index structures (e.g., M-trees, tries) to be 
persisted. 

Some recent techniques address a related problem of eliminating "fuzzy 
duplicates" in a relation by using a similarity function by identifying highly similar tuples 
as duplicates. Some are based on the use of edit distance [e.g., M. Hernandez and S. 

10 Stolfo. The merge/purge problem for large databases. In Proceedings of the ACM 
SIGMOD, San Jose, CA, May 1995.] and some on cosine similarity with IDF weights 
[e.g., W. Cohen. Data integration using similarity joins and a word-based information 
representation language. ACM Transactions on Information Systems, 18(3):288-321, 
July 2000. ]. Such techniques are designed for use in an offline setting and do not satisfy 

15 the efficiency requirements of an online fuzzy match operation where input tuples have to 
be quickly matched with target reference tuples before being loaded into the data 
warehouse. A complementary need is to first clean a relation by eliminating fuzzy 
duplicates and then piping further additions through the fuzzy match operation to prevent 
introduction of new fuzzy duplicates. 

20 

Summary of the Invention 

An exemplary system provides a robust and efficient fuzzy match process that is 

applicable across a wide variety of domains. The system provides a strong foundation for 

adding domain-specific enhancements. Most data warehouses are built atop database 

25 systems. Consequently, in addition to robustness and efficiency the disclosed fuzzy match 

solution is implemented over standard database systems without requiring the persistence 

of complex data structures. 

An important ingredient of the fuzzy match operation is the similarity function 
used for comparing tuples. In typical application domains, the similarity function must 
30 definitely handle string-valued attributes and .possibly even numeric attributes. For 
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string-valued attributes, defining similarity and performing fiizzy matching is 
challenging. Given the similarity function and an input tuple, one goal of the fiizzy match 
operation is to return the reference tuple — a tuple in the reference relation — which is 
closest to the input tuple. An alternative return the closest K reference tuples enabling 
users to choose one among them as the target, rather than the closest. An additional 
alternative is to output K or fewer tuples whose similarity to the input tuple exceeds a 
user-specified minimum similarity threshold. This formulation is similar to a nearest 
neighbor problem, but in prior art nearest neighbor analyses the domain is typically a 
Euclidean (or other normed) space with well-behaved similarity functions. In the present 
system, the data are not represented in "geometric" spaces, and it is hard to map them into 
such spaces because the similarity function is relatively complex. 

The system uses a novel fuzzy match similarity function that explicitly considers 
DDF token weights and input errors while comparing tuples. One implementation uses an 
error tolerant index and a probabilistic process for efficiently retrieving the K reference 
tuples closest to the input tuple, according to a fuzzy match similarity function. The 
exemplary process can be extended to use specialized (possibly domain-specific) token 
weight functions instead of IDF token weights. 

These and other objects, advantages and features of the invention are more fully 
understood from a review of an exemplary embodiment of the invention which is 
described in conjunction with the accompanying claims. 

Brief Description of the Drawings 

Figure 1 is a schematic depiction of a representative computer system for 
implementing an exemplary embodiment of the invention; 

Figure 2 is a flow chart indicating a process of evaluating input records or tuples 
based on the contents of a reference table; 

Figure 3 is a depiction showing a computation of edit distance between two 
different tokens; 
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Figure 4 is a depiction of a process for creating q-gram vectors from an input tuple 
having multiple tokens in different columns; 

Figure 5 is a more detailed flowchart illustrating the processing steps of an 
exemplary embodiment of the present invention; 

5 Figures 6A and 6B show a construction of a hash table for tabulating of a 

reference score for reference records based on an analysis of an input record. 

Exemplary embodiment for practicing the invention 

Figure 1 depicts an exemplary data processing system for practicing the disclosed 
10 invention utilizing a general purpose computer 20. A data mining software component 
that executes on the computer 20 accesses a database to extract data records stored within 
that database. An application program 36 either executing on the computer 20 or in 
communications with the computer 20 by means of a communications hnk such as a 
network 51 makes requests of a data mining engine. 

15 As seen by referring to Figure 1 the computer 20 includes one or more processing 

units 21, a system memory 22, and a system bus 23 that couples various system 
components including the system memory to the processing unit 21. The system bus 23 
may be any of several types of bus structures including a memory bus or memory 
controller, a peripheral bus, and a local bus using any of a variety of bus architectures. 

20 The system memory includes read only memory (ROM) 24 and random access 

memory (RAM) 25. A basic input/output system 26 (BIOS), containing the basic 
routines that help to transfer information between elements within the computer 20, such 
as during start-up, is stored in ROM 24. 

The computer 20 further includes a hard disk drive 27 for reading from and 
25 writing to a hard disk, not shown, a magnetic disk drive 28 for reading from or writing to 
a removable magnetic disk 29, and an optical disk drive 30 for reading from or writing to 
a removable optical disk 31 such as a CD ROM or other optical media. The hard disk 
drive 27, magnetic disk drive 28, and optical disk drive 30 are connected to the system 
bus 23 by a hard disk drive interface 32, a magnetic disk drive interface 33, and an optical 
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drive interface 34, respectively. The drives and their associated computer-readable media 
provide nonvolatile storage of computer readable instructions, data structures, program 
modules and other data for the computer 20. Although the exemplary environment 
described herein employs a hard disk, a removable magnetic disk 29 and a removable 
5 optical disk 31, it should be appreciated by those skilled in the art that other types of 
computer readable media which can store data that is accessible by a computer, such as 
magnetic cassettes, flash memory cards, digital video disks, BemouUi cartridges, random 
access memories (RAM), read only memories (ROM), and the like, may also be used in 
the exemplary operating environment. 

10 A number of program modules including the data mining software component 12 

may be stored on the hard disk, magnetic disk 29, optical disk 31, ROM 24 or RAM 25, 
including an operating system 35, one or more application programs 36, other program 
modules 37, and program data 38. A user may enter commands and information into the 
computer 20 through input devices such as a keyboard 40 and pointing device 42. Other 

15 input devices (not shown) may include a microphone, joystick, game pad, satellite dish, 
scanner, or the like. These and other input devices are often connected to the processing 
unit 21 through a serial port interface 46 that is coupled to the system bus, but may be 
connected by other interfaces, such as a parallel port, game port or a universal serial bus 
(USB). A monitor 47 or other type of display device is also connected to the system bus 

20 23 via an interface, such as a video adapter 48. In addition to the monitor, personal 
computers typically include other peripheral output devices (not shown), such as speakers 
and printers. 

The computer 20 may operate in a networked environment using logical 
connections to one or more remote computers, such as a remote computer 49. The 

25 remote computer 49 may be another personal computer, a server, a router, a network PC, 
a peer device or other common network node, and typically includes many or all of the 
elements described above relative to the computer 20, although only a memory storage 
device 50 has been illustrated in Figure 1. The logical connections depicted in Figure 1 
include a local area network (LAN) 51 and a wide area network (WAN) 52. Such 

30 networking environments are commonplace in networks, intranets and the Internet. 
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When used in a Ian networking environment, the computer 20 is connected to the 
local network 51 through a network interface or adapter 53. When used in a wan 
networking environment, the computer 20 typically includes a modem 54 or other means 
for establishing communications over the wide area network 52, such as the internet. The 
5 modem 54, which may be internal or external, is connected to the system bus 23 via the 
serial port interface 46. In a networked environment, program modules depicted relative 
to the computer 20, or portions thereof, may be stored in the remote memory storage 
device. It will be appreciated that the network connections shown are exemplary and 
other means of establishing a communications link between the computers may be used. 

10 

Fuzzy Match 

In accordance with an exemplary system, the similarity between an input tuple 
110 (Figure 2) and a reference tuple contained within a reference table 112 is the cost of 
transforming the input tuple into the reference tuple — ^the less the cost, the higher the 

15 similarity. An exemplary process depicted in Figure 2 first checks to see if the input tuple 
110 is an exact match with a tuple in the reference table 1 12. If this test 1 14 succeeds the 
tuple is entered. If the test 1 14 does not succeed, a fuzzy match process 1 16 is performed. 
The system tests 1 18 an output from this fuzzy match to determine if a threshold criteria 
has been satisfied and if the threshold was satisfied, the input tuple is corrected 

20 (presumably one of the record fields that is tested was incorrect) and loaded into the 
database. A user interface for initiating the exemplary process presents a user a means of 
choosing the reference table 112 and a source of one or more input tuples. Once the 
source and reference tables are determined, the interface allows the user to pick data 
attributes on which to check records for similarity. Not all text attributes must be selected 

25 during this process. 

Prior art approaches addressing the fuzzy match operation either adopt proprietary 
domain-specific functions (e.g., TriUium's reference matching operation for the address 
domain) or use the string edit distance function for measuring similarity between tuples. 
A limitation of the edit distance is illustrated by the following example. The edit distance 
30 function would consider the input tuple 13 in Table 2 to be closest to R2 in Table 1, even 
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though it is known that the intended target is Rl . Edit distance fails because it considers 
transforming 'corporation' to 'company' more expensive than transforming 'hoeing' to 
'bon.' However, we know that 'hoeing' and '98004' are more informative tokens than 
'corporation' and so replacing 'corporation' with 'company' should be considered 
5 cheaper than replacing 'hoeing' with 'bon' and '98004' with '98014.' In yet another 
example, note that the edit distance considers 14 closer to R3 than to its target Rl. This is 
because it fails to capture the notion of a token or take into account the common error of 
token transposition. 

Table 1: Organization Reference Relation 



10 



ID 


Org. Name 


City 


State 


Zipcode 


RI 


Boeing Company 


Seattle 


m 


98004 


R2 


Bon Corporation 


Seattle 


WA 


98014 


R3 


Companions 


Seattle 


WA 


98024 



Table 2: Input Organization Tuples 



Id 


Org. Name 


City 


State 


Zipcode 


11 


Beoing Company 


Seattle 


WA 


98004 


12 


Beoing Co. 


Seattle 


WA 


98004 


13 


Boeing Corporation 


Seattle 


WA 


98004 


14 


Company Beoing 


Seattle 


NULL 


98014 



The exemplary system uses a novel fuzzy match similarity (fins) function, which 
15 views a string as a sequence of tokens and recognizes the varying "importance" of tokens 
by explicitly associating weights quantifying their importance. Tuples matching on high 
weight tokens are more similar than tuples matching on low weight tokens. The system 
uses inverse document fi-equency (DDF) weights from the IR literature for quantifying the 
notion of token importance. Informally, the importance of a token decreases with its 
20 . fi-equency, which is the number of times a token occurs in the reference relation. Even 
though the approach of weight association is common in the IR literature, the effective 
use of token weights in combination with data entry errors (e.g., spelling mistakes, 
missing values, inconsistent abbreviations) has not been considered in the prior art. 

The notion of similarity between two tuples depends on the minimum cost of 
25 "transforming" one tuple into the other through a sequence of transformation operations 
(replacement, insertion, and deletion of tokens) where the cost of each transformation 
operation is a function of the weights of tokens involved. For example, it may be cheaper 
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to replace the token *corp' with 'corporation* than to replace 'corporal' with 'corporation' 
even though edit distances suggest otherwise. This notion of similarity based on 
transformation cost is similar to edit distance except that the system operates on tokens 
and explicitly consider their weights. 

5 A goal of the exemplary fuzzy match process is to efficiently retrieve the K 

reference tuples closest to an input tuple. It is well-known that efficiently identifying the 
exact K nearest neighbors even according to the Euclidean and Hamming norms in high- 
dimensional spaces is computationally hard. Since the Hamming norm is a special case of 
the edit distance obtained by allowing only replacements, the identification of the exact. 

10 closest K matches according to the disclosure fuzzy match similarity — ^which generalizes 
edit distance by incorporating token weights — is essentially hard. 

The system adopts a probabilistic approach where the goal is to return the closest 
K reference tuples with high probability. The reference relation is pre-processed to build 
an index relation, called an error tolerant index (ETI) relation, for retrieving at run time a 

15 . small set of candidate reference tuples, which are then compared with the input tuple. 
This retrieval process is probabilistically safe because it retrieves (with high probability) 
a superset of the K reference tuples closest to the input tuple. It is efficient because the 
superset is significantly (often by several orders of magnitude) smaller than the reference 
relation. The index relation ETI is implemented and maintained as a standard relation or 

20 table, and hence can be deployed over current operational data warehouses. 

The Similarity Function 

An understanding of the fuzzy match similarity (fms) function for comparing tuples is 
based on a few definitions. 

Edit Distance: The edit distance ed(si, S2) between two strings si and S2 is the minimum 
25 number of character edit operations (delete, insert, and substitute) required to transform Si 
into S2, normaUzed by the maximum of the lengths of si and S2. For the example shown in 
Figure 3 the edit distance between the strings 'company' and 'corporation' is 7/11 »0.64, 
and the sequence of edit operations is shown. Vertical lines indicate either exact matches 
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(cost is 0) or substitutions (cost is 1). Characters that are deleted or inserted have a unit 
cost. 

Reference Relation: Let R[rirf, Ai,...,An] be a reference relation where Ai denotes the i^^ 
5 column. Assume that each Aj is a string-valued attribute (e.g., of type varchar). Also 
assume that tid (for tuple identifier) is a key of R. Refer to a tuple whose tid attribute 
assumes value r as the tuple r. Use v[i] to denote the value ai in the tuple v[r, ai,. . .,an]. 

Tokenization: Let tok be a tokenization function which splits a string s into a set of 
10 tokens, tok(s), based on a set of delimiters (say, the white space characters). For example, 
tok(v[l]) of the tuple v = [Rl, Boeing Company, Seattle, WA, 98004] is {boeing, 
company}. Observe that the function ignores case while generating tokens. For tokens 
generated from attribute values of tuples, associate the column property — the column 
from which a token originates. For example, the column property of tokens in tok(v[c6>/]) 
15 is coL Consequently, the token 'madison' in the name column of a customer relation is 
considered different from the token *madison' in the city column. The token set tok(v) is 
the multiset union of sets tok(ai ),..., tok(an) of tokens from the tuple v[r, ai,...,an]. That is, 
if a token t appears in multiple columns, the process retains one copy per column in 
tok(v), distinguishing each copy by its column property. One says that a token t is in 
20 tokfv) if t is a member of some tok(ai), for 1 <i <n. 

Weight Function: The exemplary system adapts the IDF weight fimction to the relational 
domain by treating each tuple as a document of tokens. The motivation for this definition 
is clear from the following example - one expects the weight of token 'corporation' in the 
25 organization-name column to be less than that of 'united' since corporation is a frequent 
token in that column. Let the frequency of token t in column i, denoted freq(t, i), be the 
number of tuples v in R such that tok(v[i]) contains t. The IDF value, IDF(t, i), of a token 
t with respect to the i'^ colunm in the schema of R is computed as follows, when freq(t, i) 
>0, 

^^{t,i)= IDF (/,/)= log I ^ ' 

freq 0,1) 
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For a token t whose frequency in column i is 0, the system presumes that the token t is an 
erroneous version of some token in the reference tuple. Since the token to which it 
corresponds is unknown, the system defines the weight w(t, i) to be the average weight of 
all tokens in the i^^ column of relation R. For clarity in presentation, when the column 
5 property of a token is evident from the context, it is appropriate to use w(t) to denote w(t. 

Fuzzy Similarity Function {fms) 

The exemplary system employs a fuzzy similarity function which considers the 
following transformation operations: token replacement, token insertion, and token 
10 deletion. Each operation is associated with a cost that depends on the weight of the token 
being transformed. Let u and v be two tuples having a schema R[Ai,, . .,An]. Consider the 
case where u is an input tuple and v is a reference tuple, and the system is interested in 
the cost of transforming u into v. 

(i) Token replacement: The cost of replacing a token ti in tok(u[i]) by token t2 from 
15 tok(v[i]) is ed(ti,t2)-w(ti,i). If ti and ii are from different columns, the cost is infinite. 

(ii) Token insertion: The cost of inserting a token t into u[i] is Cins*w(t, i), where the 

token insertion factor Cins is a constant between 0 and 1 . 

(iii) Token deletion: The cost of deleting a token t from u[i] is w(t,i). 

Observe that the costs associated with inserting and deleting the same token may be 
20 different. This asymmetry is useful, since it may be more likely for tokens to be left out 
during data entry than it is for spurious tokens to be inserted. Therefore, absence of 
tokens is not as heavily penalized. 

The system ignores the tid attribute while comparing tuples. Transforming u into v 
requires each column u[i] to be transformed into v[i] through a sequence of 
25 transformation operations, whose cost are defined to be the sum of costs of all operations 
in the sequence. The transformation cost tc(u[i], vfij) is the cost of the minimum cost 
transformation sequence for transforming u[i] into v[i]. The cost tc(u, v) of transforming 
u into V is the sum over all columns i of the costs tc(u[i], v[i]) of transforming u[i] into 
v[i]. 
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tc(u,v)= Y,tc(u[ilv[i]) 

i 

The minimum transfomiation cost tc(u[i], v[i]) can be computed using the 
dynamic programming algorithm used for edit distance computation. See T. F. Smith and 
M. S. Waterman. Identification of common molecular subsequences. Journal of 
5 Molecular Biology, 147:195-197, 1981. 

Consider the input tuple u[Beoing Corporation, Seattle, WA, 98004] in Table 2 
and the reference tuple v[Boeing Company, Seattle, WA, 98004]. The minimum cost 
transformation of u[l] into v[l] requires two operations - replacing 'beoing' by 'hoeing* 
and replacing 'corporation' by 'company'. The function tc(u[l], v[l]) is the sum of costs 
10 of these two operations; assuming unit weights on all tokens, this is 0.97 by adding 0.33 
for replacing 'beoing' with 'boeing' which are at an edit distance 0.33, and 0.64 for 
replacing the token 'corporation' 120 (Figure 3) with the token 'company' 122 which are 
at an edit distance 0.64. In this example, only tc(u[l], v[l]) is nonzero among column- 
wise transformation costs. 

15 Definition of fins: 

The fuzzy match similarity fiinction fms(u, v) between an input tuple u and a 
reference tuple v is defined in terms of the transformation cost tc(u, v). Let w(u) be the 
sum of weights of all tokens in the token set tok(u) of the input tuple u. Similarity 
between u and v is defined as: 

fms (m , v) = 1 - min( — ^— ^ — -,\ .0) 
w(u) 

In the above example involving 13 and Rl, w(I3) = 5.0 there are five tokens in tok(Il) and 
the weight of each token is 1.0. Therefore, fiiis(I3, Rl) = 1 - 0,97/5.0 = 0.806. It is usefiil 
to define fins asymmetrically because it is beHeved the cost of transforming a dirty input 
tuple into a clean reference tuple is different from the reverse transformation. 

25 Edit Distance and fins 

For a broad subclass of errors, it is usefiil to compare the weight assignment 
strategy implicitly adopted by the edit distance ed with that of the fiizzy match similarity 
fins, to isolate scenarios when they agree or disagree on fuzzy match. The comparison 
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also justifies, although only informally, the belief that fins is the more appropriate choice 
in practice. 

The exemplary system considers the subclass of order-preserving errors. Under 
this class of errors, an input tuple and its target reference tuple are consistent in the 

5 ordering among tokens after each input token is mapped to the closest matching reference 
token, and each input token is transformed to its counterpart in the reference tuple. Let 
ui,. . .,Um be the list of tokens in the input tuple u ordered according to their position in u. 
Let vi,...,Vmbe the similarly ordered list of tokens in the reference tuple v. In the class of 
order-preserving errors, for all i, the input token Ui is transformed to the reference token 

10 Vj. Let ed(u, v) denote the total (minimum) number of edit operations for transforming 
each Ui into Vj, normalized by max(L(u), L(v)) where the length L(z) of a tuple z is the sum 
of lengths of tokens Zi,...,Zp in tok(z), i.e., L(z)=Y\^i\, Now rewrite ed(u, v) to highlight 
the implicit weight assignment to the Ui->Vi token-mapping. 

J/ \ ^(w) max(| M/|,|v. I) . (\\ 

max( L(m), L{v)) Y ^(w) 

15 Observe that the Ui^Vi mapping gets a weight proportional to max(|ui|, |vi|)/L(u). 
Therefore, ed implicitly assigns weights to token mappings in proportion to their lengths, 
i.e., longer tokens get higher weights. For example, ^corporation' to 'company' gets a 
higher weight than *boeing' to 'bon' thus explaining why ed matches input tuple 13 (in 
Table 2) with R2 (in Table 1) instead of the correct target Rl. Extensive empirical 

20 evidence from the IR appUcation suggests the superiority of IDF weights to token lengths 
for capturing the notion of token importance. 

The K-Fuzzv Match Problem: 

Given a reference relation R, a minimum similarity threshold c (0 < c < 1), the 

25 similarity fiinction f, and an input tuple u, find the set FM(u) of fuzzy matches of at most 

K tuples fi-om R such that 

(i) fins(u, v) >c, for all v in FM(u) 

(ii) fiiis(u, v) >fiiis(u, v') for any v in FM(u) and v' in R-FM(u) 
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Observe that by setting the minimum similarity threshold c to be zero, the system 
responds to a user interested in all closest K reference tuples. When more than K-i+1 
reference tuples are tied for the i^^, K^*" (i > 1) best fuzzy matches, the system breaks 
ties by choosing an arbitrary subset of the tied reference tuples such that the total number 
5 of returned fuzzy matches is K. 

Given an input tuple u, the goal of the fuzzy match process is to identify the fuzzy 
matches — the K reference tuples closest to u. A naive algorithm scans the reference 
relation R comparing each tuple with u. A more efficient approach practiced in 
accordance with an exemplary embodiment builds an "index" on the reference relation for 
10 quickly retrieving a superset of the target fuzzy matches. Standard index structures like 
B+-tree indexes cannot be deployed in this context because they can only be used for 
exact or prefix matches on attribute values. 

In an exemplary embodiment, during a pre-processing phase, additional indexing 
information for efficiently implementing the fuzzy match operation is performed. The 

15 additional information is stored as a standard database relation, and an index for this 
relation using standard B+-trees provides fast, exact lookups. This indexed relation is 
referred to as the error tolerant index (ETI). A challenge is to identify and to effectively 
use the information in the indexed relation. The exemplary embodiment derives fi-om fms 
an easily indexable similarity function fms"^^^ with the following characteristics, (i) fins^^"" 

20 upper bounds fins with high probability, (ii) The system can build the error tolerant index 
(ETI) relation for efficiently retrieving a small candidate set of reference tuples whose 
similarity with the input tuple u, as per fins*^'^, is greater (probabilistically) than the 
minimum similarity threshold c. Therefore, with a high probability the similarity as per 
fms between any tuple in the candidate set and u is greater than c. From this candidate set, 

25 the system returns the K reference tuples closest to u as the fuzzy matches. 

A pproximation of fms 

To enhance efficiency, one goal of the exemplary embodiment is to derive fins*^'^ 
an approximation of the fuzzy match similarity function fins for which the system can 
30 build an indexed relation, fins^'''^ . This is a pared down version of fms obtained by (i) 
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ignoring differences in ordering among tokens in the input and reference tuples, and (ii) 
by allowing each input token to match with the "closest" token from the reference tuple. 
Since disregarding these two distinguishing characteristics while comparing tuples can 
only increase similarity between tuples, fins^^^ is an upper bound of fins. 

5 The tuples [boeing company, Seattle, wa, 98004] and [company boeing, Seattle, 

wa, 98004] which differ only in the ordering among tokens in the first field are 
considered identical by frns^^'^. In fins^^^, the system measures the closeness between two 
tokens through the similarity between sets of substrings — called q-gram sets — of tokens 
(instead of edit distance between tokens used in fms). Further, this q-gram set similarity is 

10 estimated accurately by the commonality between small probabilistically chosen subsets 
of the two q-gram sets. This property can be exploited to build an indexed relation for 
frns^^'^ because for each input tuple the system only has to identify reference tuples whose 
tokens share a number of chosen q-grams with the input tuple. First, define the 
approximation of the q-gram set similarity between tokens. In Lemma 2, (below) this 

15 similarity is related with the edit distance between tokens using an "adjustment term" 
which only depends on the value of q introduced below. 

Q-gram Set: Given a string s and a positive integer q, the set QGq(s) of q-grams of s is the 
set of all size q substrings of s. For example, the 3-gram set QGsC'boeing") is the set of 
substrings {boe, oei, ein, ing} . Because q is fixed to be a constant, it is convenient to use 
20 QG(s) to denote QGq(s). 

Jaccard Coefficient: The Jaccard coefficient sim(Sj, S2) between two sets Si and S2 



Min-hash Similarity: Let U denote the universe of strings over an alphabet and 
hi.'U-^N, i = 1,.. .,H be H hash functions mapping elements of U uniformly and randomly 
25 to the set of natural numbers N. Let S be a set of strings. The min-hash signature mh(S) of 
S is the vector [mhj(S), mhH(S)] where the coordinate mhi(S) is defined 
^^mh^iS) = argmin h^ia) • Let I[X] denote an indicator variable over a boolean X, i.e., I[X] = 

aeS 

1 if X is true, and 0 otherwise. Then, 
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E[sim(S,.S,)] = ^f^I[mh,{S,)=mh,{S,)] 

Computing a min-hash signature is described in A. Broder. "On the resemblance and 
containment of documents." In Compression and Complexity of Sequences 
(SEQUENCES '97), 1998. and E. Cohen. "Size estimation framework with appHcations 
5 to transitive closure and reachability." Journal of Computer and System Sciences, 1997. 

The process of selecting an element in the min-hash signature is like throwing 
darts at a board and stopping when an element of S. is hit. Hence, the probability that an 
element in Si n S2 is hit before another element in SiU S2 is equal to sim(Si,S2). Now 
define token similarity in terms of the min-hash similarity between their q-gram sets. Let 
10 q and H be positive integers. The min-hash similarity simjnh(th t2) between tokens ti and ii 
is: 

Sim (/, I{mh, (QG {/, )) = mh, (QG (t, ))] 

Using the similarity function fins^*'^ , it can be shown that its expectation is greater than 
fins, and (ii) the probability of fins^^'^ being greater than fins can be made arbitrarily large 
15 by choosing an appropriate min-hash signature size. 

Definition of fins^^^ : 

Let u, V be two tuples, and let dq = (1-1/q) be an adjustment term where q is the 
size of q-gram length. 

tuple Rl in Table 1. Suppose q=3 and H=2. We use the notation t:w to denote a token 
with weight w. Suppose the tokens and their weights in 14 are company:0.25, beoing:0.5, 
seattle:1.0, 98004:2.0; their total weight is 3.75. Suppose their min-hash signatures 130 - 
134 (Figure 4) are [eoi, ing], [com, pan], [sea, ttl], [980, 004], respectively. The tokens in 
25 Rl are boeing, company, Seattle, wa, 98004. Suppose their min-hash signatures are [oei, 
ing], [com, pan], [sea, ttl], [wa], [980, 004], respectively. Then, 'company' matches with 
'company', *beoing' with 'boeing', 'seattle' with 'seattle', '98004' with '98004'. The 
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score from matching *beoing' with *boeing' is: w(beoing)*(%*0.5 + (l-l/3))=w(beoing). 
Since every other token matches exactly with a reference token, fhis^^^(I4, Rl) = 
3.75/3.75. hi contrast, fins(I4, Rl) will also consider the cost of reconciling differences in 
order among tokens between 14 and Rl, and the cost of inserting token *wa'. Hence, 
fins(I4, Rl) is less than fms"P^(I4, Rl). 

Lemma 1: Let o < <5 <!,£•> o,// > 25''^ log s'' . Then 

(i) E[fms'P^(u, v)] >fms(u, v) 

(ii) P^fms"'' (m, v) ^ (1 - 5)fmsiu, v)) <. e 

Sketch of Proof: Using the following definitions. 

/i(",v) = ^X X (l-erf(/,r)) 
f^i^u,vy—y y w(0- Max (^./;;i(eG(0,e(:?(r))-H/ J Result (i) falls out of the following 

sequence of observations. 

(i) Ignoring the ordering among tokens while measuring f, and allowing tokens to be 
replaced by their best matches always results in over estimating fins. Therefore, fi(u, 

15 v) >fins(u, v). 

(ii) Edit distance between strings is approximated by the similarity between the sets of 
q-grams (Lemma 4.2 below), and max(|t|, |r|) >|QG(t) U QG(r)|/2. Hence, f2(u, v) > 
fi(u,v). 

(iii) Min-hash similarity between tokens is an unbiased estimator of the Jaccard 
20 coefficient between q-gram sets of tokens. Therefore, E[fms^^^(u, v)] = f2(u, v) > 

fins(u, v). 

Since E[fms^P''(u, v)] = f2(u, v) >fms(u, v) for all H > 0, splitting fms'P''(u, v) into the 
average of H independent functions fi', fu one for each min-hash coordinate such 
that fi' has the same expectation as fms^^^ and using Chemoff bounds, one has the 
25 following inequality, which yields Result (ii). 

«y-///2(".v) 

E[X <i\-S)f(u,v)]<E[X <(l-^)£[^])<e ^ 

Lemma 2: Let ti, t2 be two tokens, and m = max(|ti|, |t2|). Let d = (l-l/q)'(l-l/m). Then, 



MS 301555.1 final 



Because the probability P(fms^^^(u, v) >(1-^ )frns(u, v)) can be increased arbitrarily, it is 
appropriate to say that fins^^'^ upper bounds fins. 

The Error Tolerant Index (ETI) 
5 A primary purpose of the error tolerant index ETI is to enable, for each input tuple 

u, the efficient retrieval of a candidate set S of reference tuples whose similarity with u 

(the input tuple) is greater than the minimum similarity threshold c. The value of 

fins^^'^(u, v) is measured by comparing min-hash signatures of tokens in tok(u) and tok(v). 

Therefore, to determining the candidate set, the system must efficiently identify for each 

10 token t in tok(u), a set of reference tuples sharing min-hash q-grams with that of t. 
Consider the example input tuple [Beoing Company, Seattle, WA, 98004] shown in 
Figure 4. The topmost row in the figure lists tokens in the input tuple, the next row lists 
q-gram signatures 130-134 of each token. The lowest row lists sets (Si through S9) of 
tuple identifiers or tids of reference tuples with tokens whose min-hash signatures contain 

15 the corresponding q-gram. For example, the set Si U S2 is the set of tids of reference 
tuples containing a token in the Org. Name column that shares a min-hash q-gram with 
*beoing'. Extending this behavior to q-gram signatures of all tokens, the union of all Si's 
contains the candidate set S. In order to identify such sets of token identifiers or tids, the 
system stores in the ETI each q-gram s along with the list of all tids of reference tuples 

20 with tokens whose min-hash signatures contain s. 

To formally describe the ETI and its construction, let R be the reference relation, 
and H the size of the min-hash signature. ETI is a relation with the following schema: 
[QGram, Coordinate, Colunan, Frequency, Tid-list] such that each tuple e in ETI has the 
following semantics. e[Tid-list] is a list of tids of all reference tuples containing at least 
25 one token t in the field e[Column] whose e[Coordinate]-th min-hash coordinate is 
e[QGram]. The value e[Frequency] is the number of tids in e[Tid-list]. Constructing a 
. tuple [s, j, i, fi-equency, tid-list] in ETI requires that the system know the list of all 
reference tuple tids containing i^*^ colunm tokens with s as their j^^ min-hash coordinate. 
One method of computing all ETI tuples in main-memory, by scanning and processing 



MS 301555.1 final 



19 



each reference tuple, is not scalable because the combined size of all tid-lists can easily be 
larger than the amount of available main memory. To build the ETI efficiently, the system 
leverages the underlying database system by first building a temporary relation called the 
pre-ETI with sufficient information and then construct the ETI relation from the pre-ETI 
relation using SQL queries. 

The schema of the pre-ETI is: [QGram, Coordinate, Column, Tid]. The system 
scans the reference relation R processing each tuple v as follows. It tokenizes v, and for 
each i^^ column token t in tok(v), it determines its min-hash signature mh(t) of size H. It 
then inserts into a pre-ETI relation a row [q, j, i, r] for the j**^ min-hash coordinate in 
mh(t). For example, if the size-2 signature of the token 'company' belonging to column 1 
of the tuple Rl is [com, pan], then the system inserts the rows [com, 1, 1, Rl], [pan, 1, 1, 
Rl] into the pre-ETI. In practice, such insertions can be batched. 

All tuples required to compute any one ETI tuple occur together in the result of 
15 the pxQ-ETI-query: "select QGram, Coordinate, Column, Tid from pre-ETI order by 
QGram, Coordinate, Column, Tid." The system scans the result of the pre-ETI-query, 
and for a group of tuples corresponding to the q-gram s which occurs as the j^^ min-hash 
coordinate of (multiple) tokens in the i**^ column, the system inserts the tuple [s, j, i, 
freq(s, j, i), tid-list] in ETI, where freq(s, j, i) is the size of the group, and tid-list the list 
20 of all tids in the group, q-grams whose frequencies are above a large threshold, called the 
stop q-gram threshold (set to 10000 in one implementation), are considered stop tokens. 
For such q-grams, the system inserts a NULL value in the tid-list column. Finally, a 
clustered index is built on the [QGram, Coordinate, Column] attribute combination of the 
ETI relation so that queries looking up ETI on [QGram, Coordinate, Column] 
25 combinations are answered efficiently. 

An example ETI relation for the reference relation in Table 1 with q=3 and H=2 is 
shown in Table 3 and Figure 6A. If the length of a token is less than q, then it is assumed 
that its min-hash signature consists of the token itself The tuple [Rl, Boeing Company, 
Seattle, WA, 98004] in Table 1 with min-hash signatures {[oei, ing], [com, pan], [sea, 
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ttl], [wa], [980, 004]} for its tokens, respectively, has the tid Rl in the tid-Hsts of each of 
these q-grams. 



Table 3: An Example ETI Relation 



Q-gram 


Coordinate 


Column 


Frequency 


Tid-list 


oei 


1 


1 


1 


{Rl} 


ing 


2 




1 


{Rl} 


com 


1 




2 


{Rl,R3} 


pan 


2 


^ 


2 


{R1,R3} 


bon 


1 




1 


{R2} 


orp 


1 




1 


{R2} 


ati 


2 




1 


{R2} 


sea 


1 


2 


3 


{R1,R2,R3} 


ttl 


2 


2 


3 


{R1,R2,R3} 


wa 


1 


3 


3 


{R1,R2,R3} 


980 


1 


4 


3 


{R1,R2,R3} 


004 


2 


4 


1 


(Rl) 


014 


2 


4 


1 


{R2} 


024 


2 


4 


1 


{R3} 



Query Processing 

Figure 5 depicts steps for a process that performs fuzzy match queries — queries 
asking for K fuzzy matches of an input tuple u whose similarities (as per fms) with u are 
above a minimum similarity threshold c. The goal is to reduce the number of lookups 
against the reference relation by effectively using the ETI of table 3. A first embodiment 
uses a basic algorithm, which fetches tid-lists by looking up from the ETI all q-grams 
found in min-hash signatures of all tokens in u. An alternate embodiment uses an 
optimization called optimistic short circuiting, which exploits differences in token 
weights and the requirement to fetch only the K closest tuples to significantly reduce the 
number of ETI lookups. For efficient lookups, it is assumed that the reference relation R 
(Table 1) is indexed on the Tid attribute, and the ETI relation is indexed on the [QGram, 
Coordinate, Column] attribute combination. 
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The Figure 5 process for performing the fiizzy match query, starts with the receipt 
210 of an input tuple u. The process of identifying matches involves updating a table 300 
(Figure 6B) of tuple id's and score for those id's and at a step 212 this table is initialized. 
For each token t in tok(u), the process needs to compute its IDF weight w(t), which 

5 requires the frequency of t. These frequencies could be part of the ETI and could be 
fetched by issuing a SQL query for each token. However, assume that frequencies of 
tokens can be quickly looked up from a main memory cache called a token-frequency 
cache that is based on the reference table (table 1). The Figure 5 process determines 214 
the min-hash signature mh(t) of each token t. (If |t| <q (the q-gram size), the process 

10 defines mh(t)=[t].) Next the process assigns 216 the weight w(t)/|mh(t)| to each q-gram in 
mh(t). In the example below the size of mh(t) is 2 or the size of the vector produced. 

Using the ETI, the process determines 220 a candidate set S of reference tuple 
tids whose similarity (as per fms^^^ and hence fms) with the input tuple u is greater than c. 
The process then fetches fi-om the reference relation all tuples in S to verify 240 whether 
15 or not their similarities with u (as per fms) are truly above c. Among those tuples which 
passed the verification test, the process retums 250 the K tuples with the K highest 
similarity scores. 

Candidate Set Determination 

20 The exemplary system computes the candidate set S as the union of sets Sk, one 

for each q-gram q^ in the min-hash signatures of tokens in tok(u). For a q-gram qk which 
is the i^^ coordinate in the min-hash signature mh(t) of a token t in the j^^ column, Sk is the 
tid-list from the record [qk, i, j, fi:eq(qk, i, j), Sk] in ETL Observe that the lookup for [qk, i, 
j, freq(qk, i, j), Sk] is efficient because of the index on the required attribute combination 

25 of ETL Each tid in Sk is assigned a score that is proportional to the weight w(t) of the 
token t in the reference table. If a tuple with tid r is very close to the input tuple u, then r 
is a member of several sets Sk and hence gets a high overall score. Otherwise, r has a low 
overall score. Tids that have an overall score greater than w(u) c minus an adjustment 
term applied at the step 230 — a correction to approximate the edit distance between 

30 tokens with the similarity between their q-gram sets — ^make up the candidate set. 
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During the process of looking up tid-lists corresponding to q-grams, the system 
maintains the scores of tids in these tid-Usts in the hash table 300. At any point, the score 
of a tid equals the sum of weights of all q-grams whose tid-lists it belongs to. The weight 
w(qif) assigned to a q-gram qk in the min-hash signature mh(ti) of a token tj is 

5 w(ti)/|mh(ti)|. If a tid in Sk is already present in the hash table 300, then its score is 
incremented by w(qk). Otherwise, the process adds the tid to the hash table with an initial 
score of w(qk). After all q-grams in the signatures of input tokens are processed, the 
process selects 230 a tid r and add it to the candidate set S only if its score is above w(u)*c 
(minus the adjustment term). The adjustment term is a correction term to approximate 

10 the edit distance between tokens with the similarity between their q-gram sets. It is equal 
to the sum over all input tokens t of d (as defined in Lemma 2) times the weight of the 
input token. 

An optimization to the after-the-fact filtering of tids with low scores described 
above is to add a tid to the hash table 300 only if the score it can potentially get after all 
15 min-hash q-grams are processed is greater than the threshold. We add a new tid to the 
hash table only if the total weight, which is an upper bound on the score a new tid can get, 
of all min-hash q-grams yet to be looked up in the ETI is greater than or equal to w(u)-c. 
This optimization is applied at a step 224 and significantly reduces the number of tids 
added to the hash table. 

20 The procedure of Figure 5 is illustrated with the example input tuple II in Table 2 

and the ETI in Table 3 along with the hash table 300 depicted in Figure 6B. Suppose q=3 
and H=2. Use the notation [ql, q2]: w to denote the min-hash signature [ql, q2] with 
each q-gram assigned a weight of w. The tokens and their weights in II are beoing: 0.5, 
company: 0.25, Seattle: 1.0, wa: 0.75, 98004: 2.0; their total weight is 4.5, Suppose their 

25 min-hash signatures are [eoi, ing]:0.25, [com, pan] :0. 125, [sea, ttl]:0.5, [wa]:0.75, [980, 
004]: 1.0. The system accesses the ETI (table 3) to fetch the following tid-lists: [{}, 
{Rl}], [{Rl, R3}, {Rl, R3}], [{Rl, R2, R3}, {Rl, R2, R3}], [{Rl, R2, R3}], [{Rl, R2, 
R3}, {Rl}]. For the purpose of this example, ignore the adjustment term. Rl gets an 
overall score of 4.25, R2 a score of 2.75, and R3 3.0. Depending on the threshold, the 

30 candidate set is a subset of {Rl, R2, R3}. For the example in Figure 4, suppose the 
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process looked up min-hash q-grams 'eoi', 'ing', *com', 'pan', *sea', While 
processing the q-gram 'wa', the process adds new tids to the hash table only if 0.75 
(weight of wa) + 2.0 (the total weight of the remaining q-grams from the input tuple) is 
greater than w(u) c where w(u) is the total weight of the input tuple. 

5 The basic process of Figure 5 retrieves the correct fuzzy matches with a high 

probability. For the purpose of the formal guarantee in Theorem 1, it is assume that no q- 
gram is classified as a stop token. Alternatively, the stop q-gram threshold is set to at least 
|R|. 

Theorem 1: Let o < J <!,£-> o,// > 2^"Mog f"' . The basic query processing process of, 
10 Figure 5 returns the K reference tuples closest, as per fins, to the input tuple with a 
probability of at least l-e , 

Optimistic Short Circuiting (OSC^ 

Li accordance with the basic process of Figure 5, the process fetches tid-lists by 
looking up ETI of all q-grams in min-hash signatures of all tokens. In one embodiment of 

15 the invention there is a short circuiting optimization to significantly reduce the number of 
ETI lookups. The intuition is as follows. Weights of input tokens (and hence weights of 
min-hash q-grams) often vary significantly. Therefore, the system may look up the ETI on 
just a few important q-grams and — if a fetching test succeeds — optimistically short 
circuit the process by fetching the current closest K reference tuples. If it is able to 

20 efficiently verify — via a stopping test — ^whether these tuples are actually the closest K 
tuples then, a significant amount of work is saved: (i) avoid ETI lookups on a number of 
unimportant q-grams, and (ii) avoid initializing and incrementing similarity scores in the 
hash table for large numbers of tids associated with unimportant, low weight and high- 
frequency q-grams. 

25 Consider the input tuple II, the reference relation in Table 1, and the ETI relation 

in Table 3. Suppose K, the number of tuples sought =1 , q=3, and H=2. The tokens along 
with weights in II are beoing:0.5, company:0.25, Seattle: 1.0, wa:0.75, 98004:2.0; their 
total weight is 4.5. Suppose their min-hash signatures are [eoi, ing]:0.25, [com, 
pan]:0.125, [sea, ttl]:0.5, [wa]:0.75, [980, 004]:1.0. For the purpose of this example, the 
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adjustment terms are ignored. Order q-grams in the decreasing order of their weights, and 
fetch their tid-lists in this order. First fetch the tid-Hst {Rl, R2, R3} of q-gram '980.' The 
process does not distinguish between the K and (K+1)^^ (here, 1^^ and 2"^) best scores. 
So, it fetches the hst {Rl} of the next most important q-gram, that is '004'. At this point, 

5 Rl has the best score of 2.0, and R2 and R3 have scores of 1.0. The process now 
estimates the score for Rl over all q-grams of the input tuple to be, say, 4.5. The estimate 
is obtained by extrapolating (say, using a Unear extrapolation function) the current score 
to the situation when all tokens would have been retrieved. That is, if the total weight of 
all tokens is W, the total weight of tokens looked in the ETI index is W, and the current 

10 score is x, then the estimated overall score is x*W/W'. The best possible score s^next that 
R2 (the current K+1^*^ highest score tid) can get equals its current score plus the sum of 
weights of all remaining q-grams: 1.0+ (4.5-2.0) = 3.5. Observe that s^ncxt is also greater 
than the best possible (K+l)^*^ similarity — as per fms^^^ and hence fms^ — among all 
reference tuples in R. Because 4.5 > 3.5, the process anticipates the reference tuple Rl to 

15 be the closest fuzzy match, fetches it from R, and compute ftns(u, Rl). If fras(u, Rl) > 
3.5/4.5, the process stops and retum Rl as the closest fuzzy match thus avoiding looking 
up and processing tid-lists of the remaining q-grams of the input tuple: eoi, ing, com, pan, 
sea, ttl, wa. However, if fms(u, Rl) <3.5, the process continues fetching the next most 
important q-gram (here 'wa'). 

20 The robustness of the stopping test ensures that inaccuracy in estimating the score 

of Rl over all q-grams does not affect the correctness of the final result. However, it 
impacts performance. If it over-estimates, it may fetch more reference tuples and realize 
they are not good matches, and if it under-estimates then it may perform a higher number 
of ETI lookups. Note, short circuiting is especially effective if only the closest reference 

25 tuple is sought rather than the K closest tuples with K > 1 . 

The query processing algorithm enhanced with optimistic short circuiting {OSC) 
differs from the basic algorithm in two aspects: (i) the order in which the system looks up 
q-grams against ETI, and (ii) the additional short-circuiting procedure that is potentially 
invoke after looking up each q-gram. Pseudo code is almost the same as that in Figure 5 
30 except for two additional steps: the ordering of tokens and the short circuiting procedure. 
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The system orders Q, the set of all q-grams in the min-hash signatures of an input tuple in 
the decreasing order of their weights, where each q-gram s in the signature nih(t) is 
assigned a weight w(t)/|mh(t)|. After fetching tid-list(s) and processing tids in the tid-list, 
it additionally performs the short circuiting procedure (whose pseudo code is shown in 
5 listing 1 below). If the short circuiting procedure returns successfully, the process of 
Figure 5 stops and returns the closest match. 

Listing 1 Short Circuiting 

BOOLEAN ShortCircuit_ETILookups(TidScores, TupleList) 
//FetchingTest(sK, sk+i) 
10 1 Identify K+1 tids r^,. . .,r k+i with the highest similarity scores 

2 Estimate the score over Qp of r k and determine the best possible score s^l over 
Qp of r K+i 

3 If 5^ >5^*> 

4 Fetch R tuples r'l,... A 

15 5 Compare them with u to determine fins(u, r^), . . fhis(u, rV) 
//Stopping Test 

6 If fhis(u, r'j) ^^^ior all j, then assign TupleList = <r\^...yy> and return True; 
else, return false 

20 

The short circuiting procedure consists of a fetching test and a stopping test. The 
fetching test (Step 3 in listing 1) evaluates whether or not the current K tids could be the 
closest matches. On failure, the process returns and continues processing more q-grams. 
If successful, the process fetches the current best K candidates from the reference relation 
25 R (Step 4, listing 1), and compares (using fms) each of them with the input tuple u (Step 
5, Usting 1). The stopping test (Step 6, listing 1) confirms whether or not u is more 
similar to the retrieved tuples than to any other reference tuple. On success, the process 
stops and returns the current K candidate tuples as the best K fuzzy matches. On failure, it 
processes more q-grams. 

30 The fetching and stopping tests are summarized as follows. Let w(Q) denote the 

sum of weights of all q-grams in a set of q-grams Q. Let Qp=[qi,. . .,qp] denote the ordered 
list of q-grams in min-hash signatures of all tokens in the input tuple u such that w(qi) > 
w(qi+i). Let Qi denote the set of q-grams [qi,. . .., qi]. Let ssi(r) denote the similarity score 
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of the tid r plus the adjustment term after processing tid-Usts of qi,...,qi. Suppose 
r'l,. . .,r K, r K+i are the tids with the highest K+1 similarity scores after looking up q-grams 
qi,...,qi. Informally, the fetching test returns true if and only if the "estimated overall 
score" of r^ is greater than the "best possible overall score" of r k+i- We compute the 
5 estimated overall score of r k by linearly extrapolating its current similarity score ssi(r k) 
to ssi(r K)'w(Qp)/w(Qi), and the best possible overall score of r k+i by adding the weight 
(w(Qp)-w(Qi)) of all q-grams yet to be fetched to ssi(r k+i). 

False. Otherwise 

The stopping test returns successfiiUy if fins(u, r^j) >ssi(r.K+i) + w(Qp)-w(Qi), for 
10 all 1 <j <K. Since ssi(r k+i) + w(Qp)-w(Qi) is the maximum possible overall score any 
candidate outside the current top K candidates can get, if the similarities (as per fins) are 
greater than this upper bound we can safely stop because we are sure that no other 
reference tuple will get a higher score. The following theorem (whose proof we omit) 
formalizes the guarantees of the algorithm. Again, for the purpose of obtaining the formal 
15 guarantee, we assume that no q-gram is classified as a stop token. 

Theorem 2: Leto < ^ < i,^ > o, // > 2^ ' log ^ . The query processing algorithm enhanced 
with optimistic short circuiting returns the K reference tuples closest according to fins to 
the input tuple with probability at least 

Resource Requirements 

The expensive steps of the ETI building phase are: (1) scan of the reference 

relation R, (2) writing the pre-ETI, (3) sorting the pre-ETI, and (4) writing the ETL The 

total I/O cost during these phases is 0(mavg'q H'|R| + |ETI|' (12+q)) where mavg is the 

average number of tokens in each tuple, and |ETI| is the number of tuples in ETI which is 

less than H*n-|2^ — ^the maximum number of q-grams times H times the number of 

columns in R — given that Dis the alphabet over which tokens in R are formed fi-om. 

The expensive steps for processing an input tuple are: (1) looking up ETI for tid-lists of 
q-grams, (2) processing tid-lists, and (3) fetching tuples in the candidate set. The number 
of ETI lookups is less than or equal to the total number of q-grams in signatures of all 
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tokens of a tuple. On average, this number is niavg'H. The number of tids processed per 
tuple and the size of the candidate set is bounded by the sum of frequencies of all q-grams 
in the signatures of tokens in a tuple. In practice, the candidate set sizes are several orders 
of magnitude less than the above loose upper bound. Due to its dependence on the 
5 variance of token weights of input tuples, the reduction in the number of ETI lookups due 
to OSC is hard to quantify, 

Token-Frequency Cache 

hi accordance with an exemplary embodiment, the frequencies of tokens are 
maintained in a main memory token-frequency cache enabling quick computation of IDF 

10 weights. Given current main memory sizes on desktop machines, this assumption is valid 
even for very large reference relations. For example, a relation Customer [Name, city, 
state, zip code] with 1.7 milUon tuples has approximately 367,500 distinct tokens (even 
after treating identical token strings in distinct columns as distinct tokens). Assuming that 
each token and its auxiliary information (4 bytes each for column and frequency) together 

15 require on average 50 bytes, the exemplary embodiment only requires 18.375 MB for 
maintaining frequencies of all these tokens in main memory. In those rare cases when the 
token-frequency cache does not fit in main memory, other strategies are possible. 

Cache without Collisions 

An alternative system jeduces the size of the token-frequency cache by mapping 
20 each token to an integer using a 1-1 hash function (e.g., MD5 [21]). The system now only 
require 24 bytes of space (as opposed to a higher number earlier) for each token: the hash 
value (16 bytes), the column to which it belongs (4 bytes), and the frequency (4 bytes). 
Now, the token-frequency cache for the 1.7 million tuple customer relation requires only 
around 10MB. 

25 Cache with Collisions 

Another altemative is to restrict the size of the hash table to at most M entries 
allowing multiple tokens to be collapsed into one bucket. The impact on the accuracy and 
correctness of the frizzy matching algorithm depends on the collision probability. The 
more the collisions, the more hkely the system will compute incorrect token weights. 
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Alternative Extensions 

Indexing Using Tokens 

The previously described embodiments can now extend use of the ETI and the 
fuzzy match query processing algorithm to effectively use tokens for further improving 

5 efficiency. Consider the input tuple II [II, Beoing Company, Seattle, WA, 98004] in 
Table 2. All tokens except 'beoing' are correct, and this characteristic of most tokens in 
' an input tuple being correct holds for a significant percentage of input tokens. Tokens are 
higher level encapsulations of (several) q-grams. Therefore, if in building the ETI the 
process also indexes reference tuples on tokens, the process can directly look up ETI 

10 against these tokens instead of several min-hash signatures thus potentially improving 
efficiency of the candidate set retrieval. However, a challenge of this embodiment is to 
ensure that the candidate set that is fetched contains all K fuzzy matching reference 
tuples. If the system does not look up ETI on the q-gram signature of a token, say 
'beoing', it may not consider reference tuples containing a token 'boeing' close to 

15 'beoing'. And, it is possible that the closest fuzzy match happens to be the reference tuple 
containing 'boeing'. The challenge is to gain efficiency without losing accuracy. 

An alternate approach is to split importance of a token equally among itself and its 
min-hash signature by extending the q-gram signature of a token to include the token 
itself, say, as the 0^^ coordinate in the signature. The extension modifies the similarity 

20 function fms^^'^ resulting in fms Under the broad assumption that all tokens in an 
input tuple are equally likely to be erroneous, the new approximation fms -^^'^ resulting 
from the modification of the token signature is expected to be a rank-preserving 
transformation of fms^^''. That is, if vi and are two reference tuples, and u an input 
tuple then E[fms'P^(u, vO] > E[fms'P'(u, V2)] implies E[fms'-'P"(u, vi)] > E[fms-"P' (u, 

25 V2)]. Consequently, the fuzzy matches identified by using fins^^^^^ are the same as that 
identified by using fhis^^^. Efficiency is gained without losing accuracy. The following 
Lemma formally states this result. 

Definition of fms Let u be an input tuple, v be a reference tuple, t and r be tokens, q 
and H be positive integers. Define 
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2. ti i 

fms'-^(u,v)='^ X ^0)* Max i^sim\,{Ur)^d) 

Lemma 3 : If the probability of error in an input token is a constant p (0 < p < 1), then 
fins is a rank-preserving transformation of fins^''^. 

5 The construction of the ETI index relation has to be modified to write additional tuples of 
the form [token, 0, column, tid-list]. 

Column Weights 

The system can be extended to assign varying importance to columns while 
matching tuples. Let Wi,...,Wn be the weights assigned respectively to columns Ai, 

10 An such that Wi+...+Wn = L A higher Wi value exaggerates the contribution due to 
matches and differences between attribute values in the i^^ column to the overall similarity 
score. The only aspect that changes is that of weights assigned to tokens during the query 
processing algorithm. Now, a token t in the i^^ column gets a weight w(t) Wi where w(t) is 
the IDF weight and Wi is the column weight. The fuzzy match similarity function, the 

15 ETI building algorithm, and the rest of the query processing algorithm remain unchanged. 

Token Transposition Operation 

The exemplary fuzzy match similarity function may also consider additional 
transformation operations while transforming an input tuple to a reference tuple. Consider 
one such operation: the token transposition operation which re-orders adjacent tokens. 

20 Token transposition: Let u[r, ai,...,an] be an input tuple. The token transposition 
operation transforms a token pair (ti, t2) consisting of two adjacent tokens in tok(ai) where 
ii follows ti into the pair (t2, ti). The cost is a function (e.g., average, min, max, or 
constant) g(w(ti), w(t2)) of the weights of ti and ti. Because the token transposition 
operation only transforms the ordering among tokens the resulting similarity is still less 

25 (probabilistically) than fins^^^. Therefore, all the analytical guarantees of the fuzzy 
matching algorithm are still valid when it includes the token transposition operation. 

Although various alternate embodiments of the invention have been described 
with a degree of particularity, it is the intent that the invention include all modifications 
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and alterations from the disclosed designs falling within the spirit or scope of 
appended claims. 
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